Phillip (Armand) Bester is a medical scientist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa

Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio

Introduction

In part 2 of this series, we discussed the PhyloPi pipeline for conducting routine HIV phylogenetics in the drug resistance testing laboratory as a part of quality control. As mentioned, during HIV replication the error-prone viral reverse transcriptase (RT) converts its RNA genome into DNA before it can be integrated into the host cell genome. During this conversion, the enzyme makes random mistakes in the copying process. These mistakes or mutations can be deleterious, beneficial or may have no measurable impact on the replicative fitness of the virus. This fast rate of mutation provides enough divergence to be useful for phylogenetic analysis. As we will see in the 4th part of this series, the intrapatient divergence is smaller than the interpatient divergence. As infections spread from person to person the virus will continue to mutate and become more and more divergent. This allows us to use the genetic information we obtain while doing the drug resistance test and analyse the sequences for abnormalities.

Cuevas et al. (2015) published work on the in vivo rate of HIV evolution. Their analysis revealed the highest mutation rate of any biological entity of \(4.1 \cdot 10^{-3} (sd=1.7 \cdot 10^{-3})\). However, the error-prone reverse transcriptase is not the only mechanism of mutation. One defence against HIV infection is an enzyme called apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like or APOBEC. These enzymes act on RNA and converts or mutates cytidine to uridine (uridine in RNA is the thymadine couterpart in DNA). This results in a G to A mutations the the cDNA.

Also, shown by Cuevas et al, these enzymes are not equally active in all people. On the other hand, the viral Vif protein inhibits this hypermutation by ‘tagging’ the APOBEC protein with ubiquinone for degradation by the cytoplasmic ubiquitin-dependent proteasome machinery.

But how does this hypermutation affect the virus in a negative way?

We first need to understand how RNA is translated into proteins. Below is a table showing the codon combinations for each of the 20 amino acids.

As can be seen from the table above, some amino acids are encoded by more than one codon. For example, if we change the codon, CGU to AGA the resulting amino acid stays Arginine or R. This is referred to as a silent mutation since the resulting protein will look the same. On the other hand, if we mutate AGU to CGU the resulting mutation is from Serine to Arginine or in single letter notation, S to R. A change in the amino acid is referred to as a non-synonymous mutation.

In reality, the APOBEC enzyme recognizes specific RNA sequence motifs, but just to give an idea of how this works, let’s look at an example.

Load some packages

library(ape)
library(Biostrings)
library(tibble)
library(tidyr)
library(dplyr)
library(knitr)
library(plotly)
library(RColorBrewer)
library(diagram)

Create a RNA sequence, remeber U is T in RNA langauge.

WT <- c("CGA", "GUU", "AUA", "GAG", "UGG", "AGU")

We have the sequence CGAGUUAUAGAGUGGAGU which we created in the cell block above as codons for clarity. We can now translate this sequence using the codon table or some function.

AA <- trans(as.DNAbin(DNAString(gsub("U", x = paste0(WT, collapse = ""), replacement = "T" ))))

AA <- as.character.AAbin(AA)[[1]]

The code block above translated our RNA sequence into a protein sequence: R, V, I, E, W, S

Now let’s mutate all occurrences of C to U/T

MUT <- gsub("C", "U", WT)

The resulting mutant sequence is: UGA, GUU, AUA, GAG, UGG, AGU and if we now translate that we get …

AA <- trans(as.DNAbin(DNAString(gsub("U", x = paste0(MUT, collapse = ""), replacement = "T" ))))

AA <- as.character.AAbin(AA)[[1]]

… the protein sequence: *, V, I, E, W, S.

The * means a stop codon was introduced. Stop codons are responsible for terminating translation from RNA to protein. If one of the viral genes has a stop codon in it the protein will truncate prematurely and the protein will most likely be dysfunctional. Mutations other than stop codons could also have a negative effect on the virus or it can cause resistance to an ARV.

Calculating genetic distances from a multiple sequence alingment (msa)

In the previous seqtion we showed the general principle of a msa. In biology sequence alingments are used to look at similarties of DNA or protein sequences. For most phylogenetic analysis a mutliple sequence alingment is a requirement and the more accurate the msa, the more accurate the phylogenetic inference.

Read in the multiple sequence alignment file.

# Read in the alignment file
aln <- read.dna('example.aln', format = 'fasta')

Next we can calculate the distance matrix using the Kimura 2-parameter (K80) model. There are various models which can be applied when looking at DNA substitution models, but first lets look at Marcov chains.

Below we have a table showing the probability of changing from one weather state to another depending on the current state.

tmA <- matrix(c(0.25,0.65,0.1,.25,0.25,.5,.35,.25,0.4),nrow = 3, byrow = TRUE)
stateNames <- c("No Rain","Light Rain","Heavy Rain")
row.names(tmA) <- stateNames; colnames(tmA) <- stateNames

tmA %>% 
  kable(
    caption = "Probabilities of weather transitions"
  )
Probabilities of weather transitions
No Rain Light Rain Heavy Rain
No Rain 0.25 0.65 0.1
Light Rain 0.25 0.25 0.5
Heavy Rain 0.35 0.25 0.4

This example is borrowed from RPubs, thank you Janpu Hou for a very clean and clearly explained blog post on the subject.

Next, we can plot our probabilities from above.

plotmat(tmA,pos = c(1,2), 
        lwd = 1, box.lwd = 2, 
        cex.txt = 0.8, 
        box.size = 0.1, 
        box.type = "circle", 
        box.prop = 0.5,
        box.col = "light blue",
        arr.length=.1,
        arr.width=.1,
        self.cex = .6,
        self.shifty = -.01,
        self.shiftx = .14,
        main = "Markov Chain")

As you can see from the table and the diagram above, we have three states (nodes) in our example and the probability of transition or staying in a stage is indicated by the edges (lines).

A big part of a scientist’s job is to observe nature and then try to apply a model to the observation. Think of Issac Newton and Albert Einstein, both of them had a very good model for gravity. Arguably, Newton’s model is easier to implement when you want to calculate a trajectory for launching a rocket and Einsteins model can explain why black holes bent light. It is time for a quote:

“All models are wrong, but some are useful” - George Box

This is very true when it comes to estimating genetic distances and phylogentic inference. Consider the image below.

The figure above shows transition and transversion events. Transition between A and G (the purines) and C and T (the pyrimidines) are more likely than transversions indicated by the red arrows. The K80 model takes this into account as one of its parameters and these rates or probabilities are calculated or estimated by maximum likelihood.

Lets see what that looks like.

tmDNA <- matrix(c(0.8,0.05,0.1,0.05,
                  0.05,0.8,0.05,0.1,
                  0.1,0.05,0.8,0.05,
                  0.05,0.1,0.05,0.8),
                nrow = 4, byrow = TRUE)
stateNames <- c("A","C","G", "T")
row.names(tmDNA) <- stateNames; colnames(tmDNA) <- stateNames

tmDNA %>% 
  kable(
    caption = "Example K80 probabilities of transitions or transversions"
  )
Example K80 probabilities of transitions or transversions
A C G T
A 0.80 0.05 0.10 0.05
C 0.05 0.80 0.05 0.10
G 0.10 0.05 0.80 0.05
T 0.05 0.10 0.05 0.80
plotmat(tmDNA,pos = c(2,2), 
        lwd = 1, box.lwd = 2, 
        cex.txt = 0.8, 
        box.size = 0.1, 
        box.type = "circle", 
        box.prop = 0.5,
        box.col = "light blue",
        arr.length=.1,
        arr.width=.1,
        self.cex = .6,
        self.shifty = -.01,
        self.shiftx = .14,
        main = "Markov Chain")

The example above is a made up one but should explain the concept of a substitutions model. The viral reverse transcriptase is not a random sequence generator, but it does make mistakes. Most of the time when it is copying the RNA into DNA the base (state) stays the same. Then also the probability of a transversion vs a transition is different. If you look at the figure above where we introduced transversion and transition you will notice, that A is more similar to G and T is more similar to C in its chemical structure.

There are many other substitution models, take a look at http://www.iqtree.org/doc/Substitution-Models. It is not always trivial to select the best model for phylogenetic inference. One technique is to run multiple maximum likelihood phylogenetic calculations using different models and then pick the model with the lowest AIC (Akaike Information Criterion). For our pipeline, we selected the rather simple K80 model. Since we are looking at different sets of sequences at each submission, a simple model is probably better in order to avoid the problems caused by overfitting.

We can use the ape package and do the distance calculations using the K80 model.

# Calculate the genetic distances between sequences using the K80 model, as.mattrix makes the rest easier
alnDist <- dist.dna(aln, model = "K80", as.matrix = TRUE)
alnDist[1:5, 1:5] %>% 
  kable(caption = "First few rows of our distance matrix")
First few rows of our distance matrix
01_AE.JP.AB253686_INT B.US.HM450245_INT B.AU.AF407664_INT B.CN.KJ820110_INT B.RU.HM466986_INT
01_AE.JP.AB253686_INT 0.0000000 0.0935626 0.0961965 0.0962887 0.0962887
B.US.HM450245_INT 0.0935626 0.0000000 0.0378446 0.0378167 0.0378748
B.AU.AF407664_INT 0.0961965 0.0378446 0.0000000 0.0454602 0.0494138
B.CN.KJ820110_INT 0.0962887 0.0378167 0.0454602 0.0000000 0.0479955
B.RU.HM466986_INT 0.0962887 0.0378748 0.0494138 0.0479955 0.0000000

The matirx has a shape of 47 by 47, we just preview the first 5 rows and columns.

Reduction of the heatmap to focus on the important data

The pipeline mentioned uses the Basic Local Alignment Search Tool (BLAST) to retrieve previously sampled sequences and adds these retrieved sequences to the analysis. BLAST is like a search engine you use on the web, but for protein or DNA sequences. By doing this important sequences from retrospective samples are included which enables PhyloPi to be aware of past sequences and not just batch per batch aware. Have a look at the paper for some examples.

The data we have is ready to use for heatmap plotting purposes, but since the data also contains previously sampled sequences, comparing those sequences amongst themselves would be a distraction. We are interested in those samples but only compared to the current batch of samples analysed. The figures below should explain this a bit better.


A diagram of a heatmap with lots of redundant and distracting data.

A diagram of a heatmap with lots of redundant and distracting data.


From the image above you can see that, typical of a heatmap, it is symmetrical on the diagonal. We show submitted vs retrieved samples in both the horizontal and vertical direction. Notice also, as annotated as “Distraction”, are the previous samples compared amongst themselves. We are not interested in those samples now as we would already have acted on any issues then. What we want instead, is a heatmap as depicted in the image below.


A diagram of a more focussed heatmap with the redundant and distracting data removed.

A diagram of a more focussed heatmap with the redundant and distracting data removed.


Luckily for us, we have a very powerful tool at our disposal, R, and plenty of really useful and convenient packages to fix this, like dplyr.

alnDistLong <- 
  alnDist %>% 
  as.data.frame(stringsToFactors = FALSE) %>% 
  rownames_to_column(var = "sample_1") %>% 
  gather(key = "sample_2", value = "distance", -sample_1, na.rm = TRUE) %>% 
  arrange(distance)

Create a new variable, combined, we will paste the names for sample_1 and sample_2 together

Final cleanup and removal of distracting data

# get the names of samples originally in the fasta file used for submission
qSample <- names(read.dna("example.fasta", format = "fasta"))

# compute new order of samples, so the new alignment is in the order of the heatmap example
sample_1 <- unique(alnDistLong$sample_1)
new_order <- c(sort(qSample), setdiff(sample_1, qSample))

Plot the heatmap using plotly for interactivity

alnDistLong %>% 
  filter(
    sample_1 %in% qSample,
    sample_1 != sample_2
    ) %>% 
  mutate(
    sample_2 = factor(sample_2, levels = new_order)
  ) %>% 
  plot_ly(
    x = ~sample_2,
    y = ~sample_1,
    z = ~distance,
    type = "heatmap", colors = brewer.pal(11, "RdYlBu"), 
    zmin = 0.0, zmax = 0.03,  xgap = 2, ygap = 1
) %>% 
  layout(
    margin = list(l = 100, r = 10, b = 100, t = 10, pad = 4), 
    yaxis = list(tickfont = list(size = 10), showspikes = TRUE),
    xaxis = list(tickfont = list(size = 10), showspikes = TRUE)
  )

Phylogenetic tree

Above we used the package ape to calculate the genetic distances for the heatmap.

Another way of looking at our alignment data is to use phylogenetic inference. The PhyloPi pipeline saves each step of phylogenetic inference to allow the user to intercept at any step. We can use the newick tree file (a text file formatted as newick) and draw our own tree.

tree <- read.tree("example-tree.txt")
plot.phylo(
  tree, cex = 0.8, 
  use.edge.length = TRUE, 
  tip.color = 'blue', 
  align.tip.label = FALSE, 
  show.node.label = TRUE
)
nodelabels("This one", 9, frame = "r", bg = "red", adj = c(-8.2,-46))

We have highlighted a node with a red block, with the text “This one” which we can now discuss. We have three leaves in this node, KM050043, KM050042, KM050041 and if you would look up these accession numbers at NCBI you will notice the Title of the paper it is tied to:

“HIV transmission. Selection bias at the heterosexual HIV-1 transmission bottleneck”

In this paper, the authors looked and selection bias when the infection is transmitted. They found that in a pool of viral quasispecies transmission is biased to benefit the fittest viral quasispecies. The node highlighted above shows the kind of clustering one would expect with a study like the one mentioned above. You will also notice plenty of other nodes which you can explore using the accession number and searching for it at https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html

The tree above is much like a dendrogram used when displaying algoromative or hierarchical clustering. The numbers on the tree above show the probability that the corresponding clusters are correct. The branch lengths indicate the distances between samples. This in conjunction with a properly coloured heatmap is very useful for finding relevant clusters to investigate. If the reason for close clustering cannot be explained, the tests are repeated.

The R markdown document and all the data needed to replicate these blog posts can be found at https://github.com/ArmandBester/R-in-medicine.

In the fourth and final part of this series, we will show how we analysed the inter- and intrapatient genetic distances of HIV sequences by logistic regression. This was useful in properly colouring our heatmap explained in this series. See you there!